Team Accuracy

Group Members:

Table of Contents

Introduction

Our research aims to use a healthcare stroke dataset to make informed decisions about an individual's risk of suffering from a stroke. We evaluate multiple health features such as gender, age, BMI, glucose levels, hypertension, and heart disease to determine the extent to which they should be weighted when providing health recommendations to at-risk individuals and when evaluating the risk portfolio of life insurance policies.

Stroke is a critical health condition that affects millions of people globally and is the 5th leading cause of death and the primary cause of disability in the United States [1]. However, it can be prevented through lifestyle changes such as losing weight, quitting smoking, and controlling cholesterol, as suggested by the American Stroke Association [2].

The dataset is comprised of anonymous patient records in the United States and aims to establish connections between factors like stroke, BMI, and smoking status. To uncover any regional health trends, we recommend supplementing the data with geographic location information in future collections.

The information derived from this research will be of great value to multiple stakeholders, including health counselors and life insurance adjusters. Health counselors can use this data to identify risk factors in their patients' lives and help them avoid serious illnesses like stroke. Life insurance adjusters, on the other hand, can use the data to offer customized premiums based on an individual's risk profile.

In conclusion, our research will be beneficial to both health counselors and life insurance adjusters by providing them with valuable insights into the risk of stroke and allowing them to make informed decisions about health and life insurance premiums. By improving the overall health of individuals, this research will also benefit the healthcare industry as a whole.

Business Understanding

1.1 Why this data analysis is important ?

  in this section we want to answer following questions:

  According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patients.

  Analyzing this data can be helpful in pre-diagnosis and preventing stroke and its consequences such as Speech and language Problems,Swallowing, Memoryloss , disability of part of body ,etc. which some of them are inversible. We believe that if the medical team can create a database using the aforementioned features, it can provide them an insights that are effective in preventing stroke and controlling patients' health. This analysis can also be helpful for offering insurance to patients who are more likely to have stroke to help them afford the expenses.

  As with all other medical information, the analysis of this dataset should be so accurate and the result could be life saver especially for the ones who wrongly classifed as those who are not in the danger.

========================================================================

Dataset: Storke Prediction Dataset URL: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

Question Of Interest: Which types of life insurance premium/costs works for the clients better?

Data Understanding

2.1 Data Description

loading data and necessary library packages

Defining our attributes, their type and description

Attribute Attr. type Description
Gender Nominal Sex of individual surveyed. Male or female.
Age Ratio Age of individual surveyed.
Hypertension Nominal Individual reported either having or not having hypertension.
Heart Disease Nominal Individual reported either having or not having heart disease.
Married Nominal Individual reported previously (or currently) married.
Work Type Nominal Individual reported employment status (private, self-employed).
Residence Type Nominal Individual reported type of residence (urban, rural).
AVG Glucose Level Ratio Individual reported average glucose level.
BMI Ratio Individual reported BMI.
Smoking Status Ordinal Individual reported smoking status (Unknown, Never Smoked, Formerly Smoked, Smokes).
Stroke Nominal Individual reported eiother having or not having a stroke.

2.1 Data cleaning & quality checking

1. checking duplicates & cleaning unnecessary data

compare the length of list and set to check if the dataset contains duplicates or not.

================================================

ID column is removed because it does not contain any useful information.
only one 'other' option in gender column can be found which cannot be compared with other datas so it is removed

2. How related our features to our data

In this survey, we are investigating for answers to the following questions?

Based on our initial assumptions, we believe:

Now let's analysis the data and see how close are our first assumptions to the tested results

for this analysis, at first we need to replacing categorical features with numerical indicators and search for any missing data

3.replacing categorical features with numerical indicators

===================================================

Based on the dataframe information, some missing values are found in the dataset in the bmi and smoking_status columns. we believe that these missing data might occur because some patients did not know how to calculte their bmi or they did not want to share bmi and smoking status.
the number of missing values in the bmi column can be replaced by the imputation. the smoking_status variable seems to be missing about 30% of the values. That might be too many to impute but due to educational purpose, we decided to impute it and if our results does not make sense we will eliminate it.also our data shows that if we drop the NaN datas our statistics will completely change so for now we stick to the imputation plan

so I think it is better to keep the NaN datas and replace them with the neighborhood values


we will use mean imputation for the numerical features (bmi) and mode for the categorical features (smoking_status).

2.3 Imputation Techniques

Let's try two methods of imputation on the bmi and smoking_status variables to see which one works better:

Split-Impute-Combine in Pandas

Impute smoking_status values using mode grouped by 'age','hypertension','heart_disease','work_type','Residence_type','stroke' then use this grouping to fill the data set in each group, then transform back

==================================================
Impute bmi values using mean grouped by 'age','hypertension','heart_disease','work_type','stroke' then use this grouping to fill the data set in each group, then transform back

Nearest Neighbor Imputation with Scikit-learn

Now let's try to fill in the bmi and smoking_status variables by selecting the 3 nearest data points to the given observation.
at first we need to normalize our data